Week 7 – Confidence Intervals for the Slope

This week’s reading is a compilation of Chapter 8 from ModernDive (Kim et al. 2020), Chapter 24 from Introduction to Modern Statistics (Çetinkaya-Rundel and Hardin 2023), with a smattering of my own ideas.

0.1 Reading Guide

Download the reading guide as a Word Document here

Download the reading guide as an HTML file here

0.2 Sampling Review

In the last reading, we studied the concept of sampling variation. Using the example of estimating the proportion of red balls in a bowl, we started with a “tactile” exercise where a shovel was used to draw a sample of balls from the bowl. While we could have performed an exhaustive count of all the balls in the bowl, this would have been a tedious process. So instead, we used a shovel to extract a sample of balls and used the resulting proportion that were red as an estimate. Furthermore, we made sure to mix the bowl’s contents before every use of the shovel. Because of the randomness created by the mixing, different uses of the shovel yielded different proportions red and hence different estimates of the proportion of the bowl’s balls that are red.

We then used R to mimic this “tactile” sampling process. Using our computer’s random number generator, we were able to quickly mimic the tactile sampling procedure a large number of times. Moreover, we were able to explore how different our results would be if we used different sized shovels, with 25, 50, and 100 slots. When we visualized the results of these three different shovel sizes, we saw that as the sample size increased, the variation in the estimates (\(\widehat{p}\)) decreased.

These visualizations of the repeated sampling from the bowl have a special name in Statistics – a sampling distribution. These distributions all us to study how our estimates ($) varied from one sample to another; in other words, we wanted to study the effect of sampling variation. Once we had over 1,000 different estimates, we quantified the variation of these estimates using their standard deviation, which also has a special name in Statistics – the standard error. Visually we saw the spread of the sampling distributions get narrower as the sample size increased, which was reiterated by the standard errors – the standard errors decreased as the sample size increased. This decrease in spread of the sampling distribution gives us more precise estimates that varied less around the center.

We then tied these sampling concepts to the statistical terminology and mathematical notation related to sampling. Our study population was the large bowl with \(N\) = 2400 balls, while the population parameter, the unknown quantity of interest, was the population proportion \(p\) of the bowl’s balls that were red. Since performing a census would be expensive in terms of time and energy, we instead extracted a sample of size \(n\) = 50. The point estimate, also known as a sample statistic, used to estimate \(p\) was the sample proportion \(\widehat{p}\) of these 50 sampled balls that were red. Furthermore, since the sample was obtained at random, it can be considered as unbiased and representative of the population. Thus any results based on the sample could be generalized to the population. Therefore, the proportion of the shovel’s balls that were red was a “good guess” of the proportion of the bowl’s balls that are red. In other words, we used the sample to infer about the population.

However, we acknowledged that both the tactile and virtual sampling exercises are not what one would do in real life; this was merely an activity used to study the effects of sampling variation. In a real-life situation, we would not take 1,000s of samples of size \(n\), but rather take a single representative sample that’s as large as possible. Additionally, we knew that the true proportion of the bowl’s balls that were red was 37.5%. In a real-life situation, we will not know what this value is. Because if we did, then why would we take a sample to estimate it?

So how does one quantify the effects of sampling variation when you only have a single sample to work with? You cannot directly study the effects of sampling variation when you only have one sample. One common method to study this is bootstrapping resampling, which will be the focus of the earlier sections of this reading.

Furthermore, what if we would like not only a single estimate of the unknown population parameter, but also a range of highly plausible values? For example, when you read about political polls, they tell you the percent of all Californians who support a specific measure, but in addition to this estimate they provide the poll’s “margin of error”. This margin of error can be used to construct a range of plausible values for the true percentage of people who support a specific measure. This range of plausible values is what’s known as a confidence interval, which will be the focus of the later sections of this reading.

1 Baby Birth Weights

Medical researchers may be interested in the relationship between baby birth weight and the age of the mother, so as to provide medical interventions for specific age groups if it is found that they are associated with lower birth weights.

Every year, the US releases to the public a large data set containing information on births recorded in the country. The births14[http://openintrostat.github.io/openintro/reference/births14.html] dataset is a random sample of 1,000 cases from one such dataset released in 2014.

1.1 Observed data

Figure 1 visualizes the relationship between mage and weight for this sample of 1,000 birth records.

ggplot(data = births14, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight of Baby (lbs)")

Figure 1: Weight of baby at birth (in lbs) as explained by mother’s age.

Table 1 displays the estimated regression coefficients for modeling the relationship between mage and weight for this sample of 1,000 birth records.

births_lm <- lm(weight ~ mage, data = births14)

get_regression_table(births_lm)
Table 1: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 6.793 0.208 32.651 0.000 6.385 7.201
mage 0.014 0.007 1.987 0.047 0.000 0.028

Based on these coefficients, the estimated regression equation is:

\[ \widehat{\text{birth weight}} = -3.598 + 0.279 \times \text{weeks of pregnancy}\]

Note

We will let \(\beta_1\) represent the slope of the relationship between baby birth weight and mother’s age for every baby born in the US in 2014. We will estimate \(\beta_1\) using the births14 dataset, labeling the estimate \(b_1\) (just as we did in Week 4).

Danger

A parameter is the value of the statistic of interest for the entire population.

We typically estimate the parameter using a “point estimate” from a sample of data. The point estimate is also referred to as the statistic.

1.2 Variability of the statistic

This sample of 1,000 births is only one of possibly tens of thousands of possible samples that could have been taken from the large dataset released in 2014. So, then we might wonder how different our regression equation would be if we had a different sample. There is no reason to believe that \(\beta_1\) is 0.279, but there is also no reason to believe that \(\beta_1\) is particularly far away from \(b_1 =\) 0.279.

Just this week you read about how estimates, such as \(b_1\), are prone to sampling variation – the variability from sample to sample. For example, if we took a different sample of 1,000 births, would we obtain a slope of exactly 0.279? No, that seems fairly unlikely. We might obtain a slope of 0.29 or 0.26, or even 0.35!

When we studied the effects of sampling variation, we took many samples, something that was easily done with a shovel and a bowl of red and white balls. In this case, however, how would we obtain another sample? Well, we would need to go to the source of the data—the large public dataset released in 2014—and take another random sample of 1,000 observations. Maybe we don’t have access to that original dataset of 2014 births, how could we study the effects of sampling variation using our single sample? We will do so using a technique known as bootstrap resampling with replacement, which we now illustrate.

1.3 Resampling once

Figure 2: Step 1: Write out mother’s ages and birth weights on 1,000 slips of paper representing one of the 1,000 births included in the original sample.

Step 1: Print out 1,000 identically sized slips of paper (or post-it notes) representing the sample of 1,000 babies in our sample. On each piece of paper, write the mother’s age and the birth weight of the baby. Figure 2 displays six of these such papers.

Step 2: Put the 1,000 slips of paper into a hat as seen in Figure 3.

Figure 3: Step 2: Putting 1,000 slips of paper (post-its) in a hat.

Step 3: Mix the hat’s contents and draw one slip of paper at random, as seen in Figure 4. Record the mother’s age and birth weight, as printed on the paper.

Figure 4: Step 3: Drawing one slip of paper at random.

Step 4: Put the slip of paper back in the hat! In other words, replace it as seen in Figure 5.

Figure 5: Step 4: Replacing slip of paper.

Step 5: Repeat Steps 3 and 4 a total of 999 more times, resulting in 1,000 recorded mother’s ages and baby birth weights.

What we just performed was a resampling of the original sample of 1,000 birth weights. We are not sampling 1,000 birth weights from the population of all 2014 US births as we did for our original sample. Instead, we are mimicking this process by resampling 1,000 birth weights from our original sample.

Now ask yourselves, why did we replace our resampled slip of paper back into the hat in Step 4? Because if we left the slip of paper out of the hat each time we performed Step 4, we would end up with the same 1,000 birth weights! In other words, replacing the slips of paper induces sampling variation.

Being more precise with our terminology, we just performed a resampling with replacement from the original sample of 1,000 birth weights. Had we left the slip of paper out of the hat each time we performed Step 4, this would be resampling without replacement.

Let’s study our sample of 1,000 resampled birth weights and mother’s ages. First, I’ve made a table of the number of times each observation (birth record) was resampled. Table 2 displays the birth records that were resampled the most often. Based on the table, it appears that record IDs 71 and 534 were resampled six times.

Table 2: Frequencies of how often a given birth record (ID) was resampled.
71 534 170 561 19 142 305 309 495 573 726 727 835 838 840 864 919 924 929 953
6 6 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Remember

When sampling with replacement, not every observation is guaranteed to be sampled. So, some observations from the original sample may never appear in the resample, whereas others may appear multiple times.

Figure 6 compares the relationship between mother’s age and baby birth weight from our resample with the relationship in our original sample.

ggplot(data = births14, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight (lbs)")

ggplot(data = resample1, 
       mapping = aes(x = mage, y = weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Mother's Age", 
       y = "Birth Weight (lbs)")

(a) Original Sample of 1,000 Birth Records

(b) Resample of 1,000 Birth Records

Figure 6: Comparing relationship between mage and weight in the resampled birth records compared to the relationship seen in the original sample of birth records.

Observed in Figure 6 that while the general shape of both relationships are similar, they are not identical.

Recall, from the previous section that the sample slope (\(b_1\)) from the original sample was . What about for our resample? Based on the scatterplot, what would your guess be? Larger than before? Smaller than before?

Let’s look at the coefficient estimates for the resampled dataset. Table 3 displays the estimated regression coefficients for modeling the relationship between mage and weight for the resample of 1,000 birth records.

resample_lm <- lm(weight ~ mage, data = resample1)

get_regression_table(resample_lm)
Table 3: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age, for the resample of 1,000 birth records.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 6.981 0.220 31.738 0.000 6.549 7.413
mage 0.005 0.008 0.684 0.494 -0.010 0.020

For the resampled dataset, the relationship between mother’s age and baby birth weight is much weaker than in the original sample, with an estimated slope of \(b_1 =\) 0.005.

What if we repeated this resampling exercise many times? Would we obtain the same slope each time? In other words, would our guess at the slope for the relationship between mother’s age and birth weight for all births in the US in 2014 exactly 0.005 every time?

2 Computer simulation and resampling

It should be very clear that tactile resampling with a dataset with 1,000 observations would be extremely time consuming, nothing we would want to ask our friends to do. A computer, however, would be happy to do this process for us!

2.1 Virtually resampling once

First, let’s perform the virtual analog of resampling once. Recall that the births14 dataset included in the openintro package contains the 1,000 birth records from the original study. Furthermore, recall in the last chapter that we used the rep_sample_n() function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows:

Let’s modify this code to perform the resampling with replacement from the 1,000 birth records in the original sample:

virtual_resample <- rep_sample_n(births14, 
                                 size = 1000,
                                 replace = TRUE, 
                                 reps = 1)

Observe how we explicitly set the replace argument to TRUE in order to tell rep_sample_n() that we would like to sample birth records with replacement. Had we not kept replace = FALSE, we would have done resampling without replacement. Additionally, we changed the sample size, as we need to create a sample with the same size as the original sample which had 1,000 observations.

Let’s look at only the first 10 out of 1,000 rows of virtual_resample:

virtual_resample
# A tibble: 1,000 × 15
# Groups:   replicate [1]
   replicate    ID  mage weight  fage mature  weeks premie visits gained lowbi…¹
       <int> <int> <dbl>  <dbl> <int> <chr>   <dbl> <chr>   <dbl>  <dbl> <chr>  
 1         1   333    29   7.4     31 younge…    39 full …     13     50 not low
 2         1   595    29   0.75    NA younge…    21 premie      4     NA low    
 3         1   338    31   6.94    NA younge…    39 full …      9     20 not low
 4         1   531    33   6.94    37 younge…    39 full …     14     20 not low
 5         1   289    35   7.98    35 mature…    38 full …     10     40 not low
 6         1   461    32   7.15    33 younge…    39 full …     13     NA not low
 7         1   137    26   7.69    36 younge…    38 full …     NA     25 not low
 8         1   828    36   4.54    65 mature…    36 premie     11     10 low    
 9         1   118    26   8.31    24 younge…    37 full …     12     25 not low
10         1    95    38   6.04    37 mature…    36 premie     10     32 not low
# … with 990 more rows, 4 more variables: sex <chr>, habit <chr>,
#   marital <chr>, whitemom <chr>, and abbreviated variable name
#   ¹​lowbirthweight

The replicate variable only takes on the value of 1 corresponding to us only having reps = 1, the ID variable indicates which of the 1,000 birth records was resampled, and mage and weight denote mother’s age and the baby’s birth weight, respectively.

Let’s now compute the slope for the relationship between mother’s age and baby’s birth weight for this resample:

virtual_resample_lm <- lm(weight ~ mage, data = virtual_resample)

get_regression_table(virtual_resample_lm)
Table 4: The least squares estimates of the intercept and slope for modeling the relationship between birth weight and mother’s age, for the virtual resample of 1,000 birth records.
term estimate std_error statistic p_value lower_ci upper_ci
intercept 7.156 0.224 31.974 0.000 6.716 7.595
mage 0.002 0.008 0.210 0.834 -0.014 0.017

As we saw when we did our tactile resampling exercise, Table 4 shows that the estimated slope is different from the slope of our original sample of 0.005.

2.2 infer package workflow

Unfortunately, our process of virtual resampling relies on us fitting a linear regression for each replicate of our virtual_resample dataset. This gets a bit tricky coding wise, as we would need to fit 35 different linear regressions if we had 35 different resamples of our data.

Enter, infer, an R package for statistical inference. infer makes efficient use of the %>% pipe operator we learned in Week 3 to spell out the sequence of steps necessary to perform statistical inference in a “tidy” and transparent fashion. Furthermore, just as the dplyr package provides functions with verb-like names to perform data wrangling, the infer package provides functions with intuitive verb-like names to perform statistical inference.

Let’s go back to our original slope. Previously, we computed the value of the sample slope \(b_1\) using the lm() function:

births_lm <- lm(weight ~ mage, data = births14)

get_regression_table(births_lm)

We’ll see that we can also do this using infer functions specify() and calculate():

virtual_resample %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  calculate(stat = "slope")

You might be asking yourself: “Isn’t the infer code longer? Why would I use that code?”. While not immediately apparent, you’ll see that there are three chief benefits to the infer workflow as opposed to the lm() function workflow we had previously.

First, the infer verb names better align with the overall resampling framework you need to understand to construct confidence intervals and to conduct hypothesis tests. We’ll see flowchart diagrams of this framework in the upcoming Figure 12.

Second, you can jump back and forth seamlessly between confidence intervals and hypothesis testing with minimal changes to your code. This will become apparent next week, when we also use infer to conduct hypothesis tests for the slope statistic.

Third, the infer workflow is much simpler for conducting inference when you have more than one variable, meaning we can extend what we’ve learned for simple linear regression to multiple linear regression models.

Let’s now illustrate the sequence of verbs necessary to construct a confidence interval for \(\b_1\), the slope of the relationship between mother’s age and baby’s birth weight for all US births in 2014.

1. specify() variables

Figure 7: “Diagram of specify()ing variables.”

As shown in Figure 7, the specify() function is used to choose which variables in a data frame will be the focus of our statistical inference. We do this by specifying the response argument. For example, in our births14 data frame, the response variable of interest is weight and the explanatory variable is mage:

births14 %>% 
  specify(response = weight, 
          explanatory = mage)
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 1,000 × 2
   weight  mage
    <dbl> <dbl>
 1   6.96    34
 2   8.86    31
 3   7.51    36
 4   6.19    16
 5   6.75    31
 6   6.69    26
 7   6.13    36
 8   6.74    24
 9   8.94    32
10   9.12    26
# … with 990 more rows

Notice how the dataset got smaller, now there are only two columns where before there were 14. You should also notice the messages above the dataset (Response: weight (numeric) and Explanatory: mage). These are meta-data about the grouping structure of the dataset, declaring which variable has been assigned to the explanatory / response.

Note

This is similar to how the group_by() verb from dplyr doesn’t change the data, but only adds “grouping” meta-data, as we saw in Week 3.

2. generate() replicates

Figure 8: “Diagram of generate() replicates.”

After we specify() the variables of interest, we pipe the results into the generate() function to generate replicates. Figure 8 shows how this is combined with specify() to start the pipeline. In other words, repeat the resampling process a large number of times, similar to how we collected 35 different samples of balls from the bowl.

The generate() function’s first argument is reps, which sets the number of replicates we would like to generate. Suppose we were interested in obtaining 50 different resamples (each of 1,000 birth records). Then, we would we set reps = 50, telling infer that we are interested in obtaining 50 different resamples, each with 1,000 observations.

The second argument type determines the type of computer simulation we’d like to perform. We set this to type = "bootstrap" indicating that we want to perform bootstrap resampling, meaning the resampling should be done with replacement. You’ll see different options for type when we learn about hypothesis testing next week.

births14 %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50, 
           type = "bootstrap")
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 50,000 × 3
# Groups:   replicate [50]
   replicate weight  mage
       <int>  <dbl> <dbl>
 1         1  10.6     31
 2         1   4.31    31
 3         1   4.94    33
 4         1   5.65    36
 5         1   8       23
 6         1   7.63    37
 7         1   8       19
 8         1   3.22    30
 9         1   8.56    32
10         1   4.32    21
# … with 49,990 more rows

Observe that the resulting data frame has 50,000 rows. This is because we performed resampling of 1000 birth records with replacement 50 times and 1000 \(\times\) 50 = 50,000.

The variable replicate indicates which resample each row belongs to. So it has the value 1 1000 times, the value 2 1000 times, all the way through to the value 50 1000 times.

Comparing with original workflow: Note that the steps of the infer workflow so far produce the same results as the original workflow using the rep_sample_n() function we saw earlier. In other words, the following two code chunks produce similar results:

infer workflow

births14 %>%
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50, 
           type = "bootstrap") 

Original workflow

rep_sample_n(births14, 
             size = 1000, 
             replace = TRUE,
             reps = 50)

3. calculate() summary statistics

Figure 9: Diagram of calculate()d summary statistics.

After we generate() many replicates of bootstrap resampling with replacement, we next want to summarize each of the 50 resamples of size 1000 to a single sample statistic value. As seen in Figure 9, the calculate() function does this.

In our case, we want to calculate the slope between mother’s age and baby’s birth weight for each bootstrap resample of size 1000. To do so, we set the stat argument to "slope".

Tip

You can also set the stat argument to a variety of other common summary statistics, like "median", "sum", "sd" (standard deviation), and "prop" (proportion). To see a list of all possible summary statistics you can use, type ?calculate and read the help file.

Let’s save the result in a data frame called bootstrap_distribution and explore its contents:

bootstrap_distribution <- births14 %>% 
  specify(response = weight, 
          explanatory = mage) %>% 
  generate(reps = 50) %>% 
  calculate(stat = "slope")

bootstrap_distribution
Response: weight (numeric)
Explanatory: mage (numeric)
# A tibble: 50 × 2
   replicate    stat
       <int>   <dbl>
 1         1 0.0328 
 2         2 0.0181 
 3         3 0.0123 
 4         4 0.0106 
 5         5 0.0183 
 6         6 0.0190 
 7         7 0.0145 
 8         8 0.00824
 9         9 0.0155 
10        10 0.0160 
# … with 40 more rows

Observe that the resulting data frame has 50 rows and two columns. The replicate column corresponds to the 50 replicates and the stat column corresponds to the estimated slope for each resample.

4. visualize() the results

Figure 10: Diagram of closing the entire process by visualize()ing the results.

The visualize()verb provides a quick way to visualize the bootstrap distribution as a histogram of the numerical stat variable’s values. The pipeline of the main infer verbs used for exploring bootstrap distribution results is shown in Figure 11.

visualize(bootstrap_distribution)

Figure 11: Bootstrap distribution of slope statistics from 50 bootstrap resamples.

Tip

The visualize() function can take many other arguments which we’ll see momentarily to customize the plot further. It also works with helper functions to do the shading of the histogram values corresponding to the confidence interval values.

Comparing with original workflow: In fact, visualize() is a wrapper function for the ggplot() function that uses a geom_histogram() layer. That’s a lot of fancy language which means that the visualize() function does the same thing as we did previously with ggplot(), just with fewer steps.

infer workflow

visualize(bootstrap_distribution)

Original workflow

ggplot(data = bootstrap_distribution,
       mapping = aes(x = stat)) +
  geom_histogram()
Danger

Careful! It might sound tempting to ditch the ggplot() code altogether now that you know of a simpler approach. The visualize() function only works for a specific case – a data frame containing a distribution of calculate()d statistics.

Let’s recap the steps of the infer workflow for constructing a bootstrap distribution and then visualizing it in Figure 12.

Figure 12: infer package workflow for resampling

2.3 Virtually resampling 1,000 times

To change from resampling 50 times to resampling 1,000 times, only one line of code needs to change—the number of reps. In the code process below, we’ve chained together the entire infer pipeline, connecting the specify, generate, calculate, and visualize steps.

bootstrap_1000 <- births14 |> 
  specify(response = weight, 
          explanatory = mage) |> 
  generate(reps = 1000, 
           type = "bootstrap") |> 
  calculate(stat = "slope") 

visualize(bootstrap_1000) + 
  labs(x = "Bootstrap Slope Statistic", 
       y = "Bootstrap Distribution of 1,000 Resamples")

Figure 13: Bootstrap distribution with 1,000 replicates.

Changing axis labels

Notice you can still use the labs() function to change your axis labels. Notice that you still need to connect the axis labels with the plot using a + sign.

What do you notice about this distribution? What distribution does it resemble? Why is that the case?

2.4 Connection to sampling distributions

The histogram of sample slopes in Figure 13 is called the bootstrap distribution. The bootstrap distribution is an approximation to the sampling distribution of sample slopes.

If you recall back to the last chapter, a sampling distribution was a distribution of statistics from repeatedly sampling from the population. However, a bootstrap distribution is a distribution of statistics from repeatedly resampling from the sample. This may seem rather strange, how can a distribution based entirely on the sample approximate a distribution based on the population? That’s a great question!

The bootstrapping process hinges on the belief that the sample is representative of the population. Meaning, there are not systematic differences between the sample and the population. For example, if there were no young mothers in the sample of 1,000 birth records in the births14 dataset. If we have a random sample, however, then on average our sample should look very similar to our population. Meaning, the individuals in the sample can “stand in” for individuals in the population with similar characteristics. So, we can think of repeatedly resampling from our sample as a similar process as sampling from the population.

The process of randomly sampling does not guarantee that smaller groups of observations will always appear in the sample. Rather, by randomly sampling from the population, on average the representation of individuals in the sample should reflect the proportion of individuals in the population.

To me, this feels rather uncomfortable as a random sample assumes that you can generalize from the sample onto the population from which it was drawn. This means you are generalizing the observations of a few individuals onto the entire population of similar individuals. For example, if you were to collect a random sample of Cal Poly students, it is likely your sample would include very few Black students (as Cal Poly is a predominantly white institution). But, if your sample was random, statistically you could infer from your small sample of Black students onto the entire population of Black students at Cal Poly. That seems kind of iffy to me.

3 Understanding confidence intervals

Let’s start this section with an analogy involving fishing. Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!

In the births14 investigation, we are trying to estimate the population slope for the relationship between mother’s age and baby’s birth weight (\(\beta_1\)) for all babies born in the US in 2014. Think of the value of \(\beta_1\) as a fish.

On the one hand, we could use the appropriate point estimate/sample statistic to estimate \(\beta_1\), which we saw in Table 1 is the sample slope \(b_1\). Based on our sample of 1000 birth records, the sample slope was 0.014. Think of using this value as “fishing with a spear.”

What would “fishing with a net” correspond to? Look at the bootstrap distribution in Figure 13 once more. Between which values would you say that “most” sample slopes lie? While this question is somewhat subjective, saying that most sample slopes lie between 0 and 0.03 would not be unreasonable. Think of this interval as the “net.”

What we’ve just illustrated is the concept of a confidence interval, which I’ll abbreviate with “CI” throughout this chapter. As opposed to a point estimate / sample statistic that estimates the value of an unknown population parameter with a single value, a confidence interval gives what can be interpreted as a range of plausible values. Going back to our analogy, point estimates / sample statistics can be thought of as spears, whereas confidence intervals can be thought of as nets.

Figure 14: Analogy of difference between point estimates and confidence intervals.

Our proposed interval of 0 to 0.03 was constructed by eye and was thus somewhat subjective. We now introduce two methods for constructing such intervals in a more exact fashion: the percentile method and the standard error method.

Both methods for confidence interval construction share some commonalities. First, they are both constructed from a bootstrap distribution, as you constructed in the previous section and visualized in Figure 13.

Second, they both require you to specify the confidence level. Commonly used confidence levels include 90%, 95%, and 99%. All other things being equal, higher confidence levels correspond to wider confidence intervals, and lower confidence levels correspond to narrower confidence intervals.

3.1 Percentile method

One method to construct a 95% confidence interval is to use the middle 95% of values of the bootstrap distribution. We can do this by computing the 2.5th and 97.5th percentiles, which are 0.0002 and 0.0281, respectively. This is known as the percentile method for constructing confidence intervals.

For now, let’s focus only on the concepts behind a percentile method constructed confidence interval; we’ll show you the code that computes these values in the next section.

Let’s mark these percentiles on the bootstrap distribution with vertical lines in Figure 15. About 95% of the slope variable values in bootstrap_1000 fall between 0.0002 and 0.0281, with 2.5% to the left of the leftmost line and 2.5% to the right of the rightmost line.

Figure 15: Percentile method 95% confidence interval. Interval endpoints marked by vertical lines.

3.2 Standard error method

Recall in the last chapter we saw that if a numerical variable follows a normal distribution, or, in other words, the histogram of this variable is bell-shaped, then roughly 95% of values fall between \(\pm\) 1.96 standard deviations of the mean. Given that our bootstrap distribution based on 1000 resamples with replacement in Figure 13 is normally shaped, let’s use this fact about normal distributions to construct a confidence interval in a different way.

First, recall the bootstrap distribution has a mean equal to 0.014. This value almost coincides exactly with the value of the sample slope \(b_1\) of our original 1000 birth records 0.014. Second, let’s compute the standard deviation of the bootstrap distribution using the bootstrap slope statistics stored in the stat column of the bootstrap_1000 data frame:

bootstrap_1000 %>% 
  summarize(SE = sd(stat))
# A tibble: 1 × 1
       SE
    <dbl>
1 0.00722

What is this value? Recall that the bootstrap distribution is an approximation to the sampling distribution. Thus, the variability of the sampling distribution may be approximated by the variability of the resampling distribution. Recall also that the standard deviation of a sampling distribution has a special name: the standard error. Putting these two facts together, we can say that 0.00722 is an approximation of the standard error of \(b_1\).

Traditional theory-based methodologies for inference also have formulas for standard errors, assuming some conditions are not violated. In this method, we are not using a formula to get our standard error, but using the standard error of the bootstrap distribution. Thus, using our 95% rule of thumb about normal distributions, we can use the following formula to determine the lower and upper endpoints of a 95% confidence interval for \(\beta_1\):

\[ b_1 \pm 1.96 \times SE = (b_1 - 1.96 \times SE, b_1 + 1.96 \times SE) \]

\[= (0.014 - 1.96 \times 0.007, 0.014 + 1.96 \times 0.007) \]

\[= (0.00028, 0.02772) \]

Note

To use the bootstrap standard error in this formula the bootstrap distribution must be bell shaped and symmetric. This is often the case with bootstrap distributions, especially those in which the original distribution of the sample is not highly skewed, however it is not always the case. Next week, we’ll go deeper into the explorations of model conditions.

Let’s now add the SE method confidence interval with dashed lines in Figure 16.

Figure 16: Comparing two 95% confidence interval methods.

We see that both methods produce nearly identical 95% confidence intervals for \(\beta_1\) with the percentile method yielding \((0.0002, 0.0281)\) while the standard error method produces \((0.0001, 0.0284)\).

Now that we’ve introduced the concept of confidence intervals and laid out the intuition behind two methods for constructing them, let’s explore the code that allows us to construct them.

4 Constructing confidence intervals

Recall how we introduced two different methods for constructing 95% confidence intervals for an unknown population parameter in Section @ref(ci-build-up): the percentile method and the standard error method. Let’s now check out the infer package code that explicitly constructs these. There are also some additional neat functions to visualize the resulting confidence intervals built-in to the infer package!

Percentile method with infer

Standard error method with infer

These are automatically calculated when level is provided with level = 0.95 being the default. (95% of the values in a standard normal distribution fall within 1.96 standard deviations of the mean, so \(multiplier = 1.96\) for level = 0.95, for example.)

This \(\bar{x} \pm (multiplier * SE)\) formula is implemented in the get_confidence_interval() function as shown with our pennies problem using the bootstrap distribution’s variability as an approximation for the sampling distribution’s variability. We’ll see more on this approximation shortly. Note that the center of the confidence interval (the point_estimate) must be provided for the standard error confidence interval.

5 Interpreting confidence intervals

5.1 Did the net capture the fish?

5.2 Precise and shorthand interpretation

5.3 Withd of confidence intervals

Çetinkaya-Rundel, Mine, and Jo Hardin. 2023. Introduction to Modern Statistics. OpenIntro. https://openintro-ims.netlify.app/.
Kim, Albert Y., Chester Ismay, Starry Zhou, Marium Tapal, Dorothy Bishop, WatanabeSmith, Tobias Rockel, et al. 2020. Moderndive/ModernDive_book: Cleaned Bookdown Code, Typo Fixes, New Tips & Tricks Page. Zenodo. https://doi.org/10.5281/ZENODO.3932023.